16 research outputs found
Scaling Data-Constrained Language Models
The current trend of scaling language models involves increasing both
parameter count and training dataset size. Extrapolating this trend suggests
that training dataset size may soon be limited by the amount of text data
available on the internet. Motivated by this limit, we investigate scaling
language models in data-constrained regimes. Specifically, we run a large set
of experiments varying the extent of data repetition and compute budget,
ranging up to 900 billion training tokens and 9 billion parameter models. We
find that with constrained data for a fixed compute budget, training with up to
4 epochs of repeated data yields negligible changes to loss compared to having
unique data. However, with more repetition, the value of adding compute
eventually decays to zero. We propose and empirically validate a scaling law
for compute optimality that accounts for the decreasing value of repeated
tokens and excess parameters. Finally, we experiment with approaches mitigating
data scarcity, including augmenting the training dataset with code data or
removing commonly used filters. Models and datasets from our 400 training runs
are freely available at https://github.com/huggingface/datablations.Comment: 47 pages (9 main), 37 figures, 13 table
OctoPack: Instruction Tuning Code Large Language Models
Finetuning large language models (LLMs) on instructions leads to vast
performance improvements on natural language tasks. We apply instruction tuning
using code, leveraging the natural structure of Git commits, which pair code
changes with human instructions. We compile CommitPack: 4 terabytes of Git
commits across 350 programming languages. We benchmark CommitPack against other
natural and synthetic code instructions (xP3x, Self-Instruct, OASST) on the 16B
parameter StarCoder model, and achieve state-of-the-art performance among
models not trained on OpenAI outputs, on the HumanEval Python benchmark (46.2%
pass@1). We further introduce HumanEvalPack, expanding the HumanEval benchmark
to a total of 3 coding tasks (Code Repair, Code Explanation, Code Synthesis)
across 6 languages (Python, JavaScript, Java, Go, C++, Rust). Our models,
OctoCoder and OctoGeeX, achieve the best performance across HumanEvalPack among
all permissive models, demonstrating CommitPack's benefits in generalizing to a
wider set of languages and natural coding tasks. Code, models and data are
freely available at https://github.com/bigcode-project/octopack.Comment: 57 pages (9 main), 39 figures, 16 table
BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting
The BLOOM model is a large publicly available multilingual language model,
but its pretraining was limited to 46 languages. To extend the benefits of
BLOOM to other languages without incurring prohibitively large costs, it is
desirable to adapt BLOOM to new languages not seen during pretraining. In this
work, we apply existing language adaptation strategies to BLOOM and benchmark
its zero-shot prompting performance on eight new languages in a
resource-constrained setting. We find language adaptation to be effective at
improving zero-shot performance in new languages. Surprisingly, we find that
adapter-based finetuning is more effective than continued pretraining for large
models. In addition, we discover that prompting performance is not
significantly affected by language specifics, such as the writing system. It is
primarily determined by the size of the language adaptation data. We also add
new languages to BLOOMZ, which is a multitask finetuned version of BLOOM
capable of following task instructions zero-shot. We find including a new
language in the multitask fine-tuning mixture to be the most effective method
to teach BLOOMZ a new language. We conclude that with sufficient training data
language adaptation can generalize well to diverse languages. Our code is
available at https://github.com/bigscience-workshop/multilingual-modeling.Comment: ACL 202
What Language Model to Train if You Have One Million GPU Hours?
The crystallization of modeling methods around the Transformer architecture
has been a boon for practitioners. Simple, well-motivated architectural
variations can transfer across tasks and scale, increasing the impact of
modeling research. However, with the emergence of state-of-the-art 100B+
parameters models, large language models are increasingly expensive to
accurately design and train. Notably, it can be difficult to evaluate how
modeling decisions may impact emergent capabilities, given that these
capabilities arise mainly from sheer scale alone. In the process of building
BLOOM--the Big Science Large Open-science Open-access Multilingual language
model--our goal is to identify an architecture and training setup that makes
the best use of our 1,000,000 A100-GPU-hours budget. Specifically, we perform
an ablation study at the billion-parameter scale comparing different modeling
practices and their impact on zero-shot generalization. In addition, we study
the impact of various popular pre-training corpora on zero-shot generalization.
We also study the performance of a multilingual model and how it compares to
the English-only one. Finally, we consider the scaling behaviour of
Transformers to choose the target model size, shape, and training setup. All
our models and code are open-sourced at https://huggingface.co/bigscience .Comment: Findings of EMNLP 202
Crosslingual Generalization through Multitask Finetuning
Multitask prompted finetuning (MTF) has been shown to help large language
models generalize to new tasks in a zero-shot setting, but so far explorations
of MTF have focused on English data and models. We apply MTF to the pretrained
multilingual BLOOM and mT5 model families to produce finetuned variants called
BLOOMZ and mT0. We find finetuning large multilingual language models on
English tasks with English prompts allows for task generalization to
non-English languages that appear only in the pretraining corpus. Finetuning on
multilingual tasks with English prompts further improves performance on English
and non-English tasks leading to various state-of-the-art zero-shot results. We
also investigate finetuning on multilingual tasks with prompts that have been
machine-translated from English to match the language of each dataset. We find
training on these machine-translated prompts leads to better performance on
human-written prompts in the respective languages. Surprisingly, we find models
are capable of zero-shot generalization to tasks in languages they have never
intentionally seen. We conjecture that the models are learning higher-level
capabilities that are both task- and language-agnostic. In addition, we
introduce xP3, a composite of supervised datasets in 46 languages with English
and machine-translated prompts. Our code, datasets and models are freely
available at https://github.com/bigscience-workshop/xmtf.Comment: 9 main pages (119 with appendix), 16 figures and 11 table
FinGPT: Large Generative Models for a Small Language
Large language models (LLMs) excel in many tasks in NLP and beyond, but most
open models have very limited coverage of smaller languages and LLM work tends
to focus on languages where nearly unlimited data is available for pretraining.
In this work, we study the challenges of creating LLMs for Finnish, a language
spoken by less than 0.1% of the world population. We compile an extensive
dataset of Finnish combining web crawls, news, social media and eBooks. We
pursue two approaches to pretrain models: 1) we train seven monolingual models
from scratch (186M to 13B parameters) dubbed FinGPT, 2) we continue the
pretraining of the multilingual BLOOM model on a mix of its original training
data and Finnish, resulting in a 176 billion parameter model we call BLUUMI.
For model evaluation, we introduce FIN-bench, a version of BIG-bench with
Finnish tasks. We also assess other model qualities such as toxicity and bias.
Our models and tools are openly available at https://turkunlp.org/gpt3-finnish.Comment: 17 pages (10 main), 7 figures, 5 table
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Large language models (LLMs) have been shown to be able to perform new tasks
based on a few demonstrations or natural language instructions. While these
capabilities have led to widespread adoption, most LLMs are developed by
resource-rich organizations and are frequently kept from the public. As a step
towards democratizing this powerful technology, we present BLOOM, a
176B-parameter open-access language model designed and built thanks to a
collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer
language model that was trained on the ROOTS corpus, a dataset comprising
hundreds of sources in 46 natural and 13 programming languages (59 in total).
We find that BLOOM achieves competitive performance on a wide variety of
benchmarks, with stronger results after undergoing multitask prompted
finetuning. To facilitate future research and applications using LLMs, we
publicly release our models and code under the Responsible AI License
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI
The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we convene a multidisciplinary effort between legal and machine learning experts to systematically audit and trace 1800+ text datasets. We develop tools and standards to trace the lineage of these datasets, from their source, creators, series of license conditions, properties, and subsequent use. Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets, with closed datasets monopolizing important categories: lower resource languages, more creative tasks, richer topic variety, newer and more synthetic training data. This points to a deepening divide in the types of data that are made available under different license conditions, and heightened implications for jurisdictional legal interpretations of copyright and fair use. We also observe frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 70%+ and error rates of 50%+. This points to a crisis in misattribution and informed use of the most popular datasets driving many recent breakthroughs. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire audit, with an interactive UI, the Data Provenance Explorer, which allows practitioners to trace and filter on data provenance for the most popular open source finetuning data collections: www.dataprovenance.org
StarCoder: may the source be with you!
The BigCode community, an open-scientific collaboration working on the
responsible development of Large Language Models for Code (Code LLMs),
introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context
length, infilling capabilities and fast large-batch inference enabled by
multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced
from The Stack, a large collection of permissively licensed GitHub repositories
with inspection tools and an opt-out process. We fine-tuned StarCoderBase on
35B Python tokens, resulting in the creation of StarCoder. We perform the most
comprehensive evaluation of Code LLMs to date and show that StarCoderBase
outperforms every open Code LLM that supports multiple programming languages
and matches or outperforms the OpenAI code-cushman-001 model. Furthermore,
StarCoder outperforms every model that is fine-tuned on Python, can be prompted
to achieve 40\% pass@1 on HumanEval, and still retains its performance on other
programming languages. We take several important steps towards a safe
open-access model release, including an improved PII redaction pipeline and a
novel attribution tracing tool, and make the StarCoder models publicly
available under a more commercially viable version of the Open Responsible AI
Model license